Detection of Foreign Entities in Native Text Using N-gram Based Cumulative Frequency Addition
نویسندگان
چکیده
This paper describes a logarithmic version of the conventional Naïve Bayesian N-gram-based, textclassification algorithm that we name Cumulative Frequency Addition (CFA) and its application in three tasks: language identification, nationality identification from names, and detection of foreign words in base text. The new CFA technique is 3-10 times faster than N-gram based rank-order statistical classifiers. In the language identification task CFA yields 100% accuracy on string sizes greater than 150 characters. In the name-tonationality task, it yields 86% accuracy on a 14 country database and 96% on a 7 country database within the top three choices. Finally, in the task of detecting foreign words it yields 66.9% accuracy. This is the first study to apply natural language processing techniques to such tasks as name identification and foreign word detection.
منابع مشابه
Language Identification from Text Using N-gram Based Cumulative Frequency Addition
This paper describes the preliminary results of an efficient language classifier using an ad-hoc Cumulative Frequency Addition of N-grams. The new classification technique is simpler than the conventional Naïve Bayesian classification method, but it performs similarly in speed overall and better in accuracy on short input strings. The classifier is also 5-10 times faster than N-gram based rank-...
متن کاملNamed Entity Recognition in Persian Text using Deep Learning
Named entities recognition is a fundamental task in the field of natural language processing. It is also known as a subset of information extraction. The process of recognizing named entities aims at finding proper nouns in the text and classifying them into predetermined classes such as names of people, organizations, and places. In this paper, we propose a named entity recognizer which benefi...
متن کاملAn unsupervised method for identifying loanwords in Korean
This paper presents an unsupervised method for developing a character-based n-gram classifier that identifies loanwords or transliterated foreign words in Korean text. The classifier is trained on an unlabeled corpus using the Expectation Maximization algorithm, building on seed words extracted from the corpus. Words with high token frequency serve as native seed words. Words with seeming trace...
متن کاملCapturing Out-of-Vocabulary Words in Arabic Text
The increasing flow of information between languages has led to a rise in the frequency of non-native or loan words, where terms of one language appear transliterated in another. Dealing with such out of vocabulary words is essential for successful cross-lingual information retrieval. For example, techniques such as stemming should not be applied indiscriminately to all words in a collection, a...
متن کاملComparing Lexical Bundles in Hard Science Lectures; A Case of Native and Non-Native University Lecturers
Researchers stated that learning and applying certain set of lexical bundles of native lecturers by non-native lecturers would help students improve their proficiency through incidental vocabulary input. The present study shed light on the lexical bundles in hard science lectures used by Native and Non-native lecturers in international universities with the main purpose of analyzing the structu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005